Deep learning predictions of TCR-epitope interactions reveal epitope-specific chains in dual alpha T cells

T cells have the ability to eliminate infected and cancer cells and play an essential role in cancer immunotherapy. T cell activation is elicited by the binding of the T cell receptor (TCR) to epitopes displayed on MHC molecules, and the TCR specificity is determined by the sequence of its α and β chains. Here, we collect and curate a dataset of 17,715 αβTCRs interacting with dozens of class I and class II epitopes. We use this curated data to develop MixTCRpred, an epitope-specific TCR-epitope interaction predictor. MixTCRpred accurately predicts TCRs recognizing several viral and cancer epitopes. MixTCRpred further provides a useful quality control tool for multiplexed single-cell TCR sequencing assays of epitope-specific T cells and pinpoints a substantial fraction of putative contaminants in public databases. Analysis of epitope-specific dual α T cells demonstrates that MixTCRpred can identify α chains mediating epitope recognition. Applying MixTCRpred to TCR repertoires from COVID-19 patients reveals enrichment of clonotypes predicted to bind an immunodominant SARS-CoV-2 epitope. Overall, MixTCRpred provides a robust tool to predict TCRs interacting with specific epitopes and interpret TCR-sequencing data from both bulk and epitope-specific T cells.


Introduction
T cells are key components of the cellular immune response, providing defense against infected and malignant cells.In cancer, inducing new T-cell responses or boosting pre-existing ones has revolutionized cancer immunotherapy treatments, providing long-term benefits to a significant fraction of patients, including some with late-stage malignancies [1][2][3] .
The activation of a T cell is triggered by the binding of the T cell receptor (TCR) to antigen-derived peptides that are presented on Major Histocompatibility Complex molecules (pMHCs).TCRs have an extensive sequence diversity, with estimates ranging from 10 15 to 10 61 different TCR sequences that can potentially be generated [4][5][6] .This high diversity allows T cells to recognize a large number of epitopes displayed on different MHC alleles 7 .As of today, high-throughput sequencing enables researchers to rapidly map TCR repertoires in patients 8,9 .However, it remains challenging to know which TCRs target specific epitopes.This hinders the development of treatments that aim at using or engineering T cells to target specific peptides displayed on MHC molecules, such as cancer neo-epitopes 10,11 .
TCRs are heterodimers composed of one α and one β chain.The TCR sequence diversity is achieved during the V(D)J recombination when a unique combination of V and J (for the α chain) and V, D and J (for the β chain) germline-encoded segments is selected and assembled.Additional diversity occurs through N-and P-nucleotide insertions at the V(D)J junctions.These regions, referred to as complementarity-determining regions 3 (CDR3), are mainly involved in recognition of the epitope, while two other CDRs (CDR1 and CDR2) located on the V segments, mediate contact primarily with the MHC 12 .
Most T cells express a unique α and a unique β chain 13 .However, T cells expressing two in-frame-rearranged TCRα or two TCRβ chains have been observed both in Mus musculus and Homo sapiens [14][15][16][17][18][19][20] .It is currently estimated that approximately 10% of T cells can express two functionally rearranged α-chains whereas dual β chains are found in less than 30   31                                                           .CC-BY-NC-ND 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted September 16, 2023.; https://doi.org/10.1101/2023.09.13.557561 doi: bioRxiv preprint 1% of T cells 19,[21][22][23][24][25] .Many TCR-sequencing analysis tools disregard dual chain T cells 26 or consider the most expressed chain as the one responsible for epitope recognition 27 .
Recent years have witnessed the emergence of a variety of immune assays to identify and sequence the α and β chains of epitope-specific TCRs 7, 28,29 .Many approaches use individual pMHC multimers to sort and sequence T cells recognizing one specific epitope 30,31 .Recently, the throughput of such approaches has been expanded by taking advantage of multiplexed DNA barcoded pMHC multimers coupled with single-cell TCR/barcode-sequencing 32 .
Another approach to enhance the number of epitopes that can be simultaneously analyzed consists of stimulating pools of T cells with various combinations of epitopes, and deconvolving the different pools 33 .Conventional bulk sequencing methods have been applied to sequence the TCRs in epitope-specific T cell populations one chain at a time, or only the β chain.More recently, single-cell TCR-sequencing has enabled acquisition of paired αβTCR sequences.This is particularly relevant for modeling TCRs of epitope-specific T cells, since both chains ultimately determine the TCR specificity 27,[34][35][36][37] .One of the most comprehensive αβTCR sequence dataset of epitope-specific T cells 29,38 was generated by the 10X Genomics immune profiling platform, coupling DNA barcoded pMHCs multimers with single-cell TCR-Seq 38 .This study identified approximately 15,000 TCR clonotypes interacting with 44 epitopes in one single experiment 38 .These data constitute around 70% of all paired αβTCR-epitopes currently stored in public databases 29,39 .Although recently developed quality control tools suggest that not all these interactions are of equal quality 27,40,41 , this type of technology is likely to play an important role in providing information about the specificity of TCRs recognizing distinct viral or cancer epitopes.
Paired TCR-pMHCs sequence data have been used to train machine learning approaches that aim at predicting which T cells can target a specific pMHC directly from the TCR sequences 42,43 .While some tools consider only the CDR3 region and/or the V, J segments of the β chain (e.g., TITAN 44 , ATM-TCR 45 , ImRex 46 , pMTnet 47 , TCRex 48,49 ), others take as input the full sequence of the TCR (i.e., the α and the β chain) 27,35,36,[50][51][52][53][54][55] .These methods range from distance-based classifiers 50,52 to machine learning or deep learning models 27,[35][36][37]49,53,54,56,57 , and they all share the common underlying assumption that TCRs displaying similar sequence patterns recognize the same pMHC 50,58 . A fraction of hese tools can be used directly through command-line or web interfaces (e.g., NetTCR2.1 35 , ERGO2.0 37,54  others need to be retrained (e.g., TCRAI 27 , TCRGP 56 ), or have been benchmarked 43 but not yet released (e.g.TCRex 48,49 for αβ TCRs, SONIA 53 for TCR classification).Most TCR-epitope interaction predictors have been trained and tested for predicting TCRs recognizing specific pMHCs with at least some known TCRs (referred to as epitope-specific predictions), although some tools include in theory the possibility to make predictions for TCRs recognizing any epitope (referred to as pan-epitope predictions).Due to different training data and procedures, it is challenging to compare the advantages and disadvantages of each approach. To adress this issue, a public benchmark for TCR-pMHC predictions was recently introduced 43 .
Several conclusions can be drawn from these studies.First, using paired α and β chains is important for modeling TCR specificity 27,[34][35][36][37]43 and predictors that rely on one of the chains (usually the β chain) have been shown to be less accurate than methods using both chains 43 .
Second, algorithms employing different approaches, from distance-based to deep learning methods, have comparable performance 43 .Third, accurate predictions require a minimum number of TCRs interacting with a specific pMHC.This demonstrates that a key determinant of TCR-pMHC interaction predictions is the quality and epitope coverage of the training set.

It further suggests that extrapolating these predictions to pMHC without known interacting
TCRs is challenging 35,43 .
In this study, we collected and curated a large dataset of paired αβTCR sequences coupled with their cognate pMHC.We leveraged these data to develop a sequence-based predictor of TCR-pMHC interaction, referred to as MixTCRpred (Figure 1A).We show that MixTCRpred can accurately predict TCRs binding to several known viral and cancer epitopes, outlines how much predictions can be extended to new epitopes, serves as a valuable control tool for identifying putative contaminants in existing databases, allows accurate annotation of epitope-specific chains in dual α T cells, and reveals enrichment of TCRs predicted to recognize an immunodominant class II epitope in TCR repertoires of COVID-19 patients (Figure 1A).

Results
. CC-BY-NC-ND 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

Integration and curation of αβTCR-pMHCs interactions reveal binding specificities for dozens of class I and class II epitopes
To improve our understanding of the specificity of TCRs for different epitopes, we collected sequences of αβTCRs targeting specific pMHCs from several public databases, including VDJdb 29 , IEDB 39 and the McPAS database 59 (Figure 1B and Methods ).TCR-pMHC sequence data from the 10X Genomics immune profiling assay 38 were processed separately to include only cases with a clear signal from one unique pMHC multimer (see Methods ).We further collected and curated TCRs isolated from Mus musculus infected with Lymphocytic choriomeningitis (LCMV) from two recent studies 60,61 .
Duplicated TCR-pMHC pairs were removed based on V/J gene usage as well as the same CDR3 sequence for both the αand βchain.This led to a total of 20,279 distinct TCR sequences interacting with 1253 pMHCs (Figure 1B).For the majority of pMHCs only one or a few TCRs have been experimentally validated, leading to a heavily skewed TCR distribution (Figure 1C-D).For further analysis, only pMHCs with ten or more binding TCRs were considered, resulting in a total of 17,715 αβTCRs interacting with 146 pMHCs (Figure 1C and Table S1).As expected, most of the data relates to peptides presented by human MHCs (127 out of 146 pMHCs), and with a large fraction of HLA-A*02:01 restricted peptides (Figure 1E).
In contrast, only 19 pMHCs in Mus musculus have been extensively characterized in terms of TCRs.One important case is the class II LCVM derived peptide, DIYKGVYQFKSV restricted to H2-IAb, for which 3650 different αβTCRs obtained from 11 LCMV infected mice were available from two different studies 60,61

MixTCRpred accurately predicts TCRs recognizing specific pMHCs
We used the collected data to train and validate MixTCRpred, a machine learning predictor of TCR-pMHC interactions (Figure 2A).MixTCRpred is a pMHC-specific predictor, where a separate model is trained for each epitope.Negative data were computationally generated by sampling TCRs with different or undetermined specificity (see Methods ).For each pMHC we chose a ratio 1:5 between positives (epitope-specific TCRs) and negatives (non-binding TCRs).
The architecture of MixTCRpred is depicted in Figure 2A.From an input TCR, which is conventionally provided as V, J genes and CDR3 sequences for both chains, MixTCRpred first extracts the CDR1, CDR2 sequences (from the V genes) as defined by the International ImMunoGeneTics Information System 62 .The CDR1, CDR2 and CDR3 sequences are then padded and concatenated for the α and β chains separately.After numerical embedding, a transformer encoder is used 63 to identify the statistical patterns underlying the TCR specificity (see Methods ).The final step is a dense classification layer whose output score indicates how likely the TCR is to interact with a specific pMHC.Comparing raw scores across different epitopes is challenging due to the inherent biases of each model.To make the predictions more interpretable and comparable, we developed a robust framework to compute the %rank which indicates how the raw score of a given TCR compares with that of a large set of randomly generated TCRs (see Methods and Supplementary Figure 1).
To explore the accuracy of our predictions and the impact of the size of the training set, we first performed a standard 5-fold cross-validation for each of the 146 MixTCRpred models.
Performance was assessed with the Area Under the receiver operating Curve (AUC) (Table S1).As shown in Figure 2B, MixTCRpred models achieved robust predictions for pMHCs with a large number of interacting TCRs: out of 43 pMHCs with more than 50 TCRs, 40 had an average AUC >0.7 and 34 of them an AUC >0.8.Lower accuracy was observed for several pMHCs with fewer TCRs.
We next performed a leave-one-sample-out cross-validation to determine whether MixTCRpred predictions were consistent across samples within the same study.15 epitopes had been analyzed in multiple samples, each of which had at least 10 TCRs. Figure 2C shows that in the majority of the cases, MixTCRpred was successful in predicting epitope-specific TCRs in a new sample, with leave-one-sample-out performances similar to that of a 5-fold cross-validation.To assess whether predictions could be transferred across different studies, we performed a leave-one-study-out validation, including epitopes with at least 10 TCRs per study.Overall, we observed only limited loss in predictive power with respect to the AUCs of 5-fold cross-validations on the same datasets (Figure 2D).This demonstrates that MixTCRpred predictions are robust across studies, and could be applied to new studies for the epitopes considered in its training set.This conserved predictive power could result from either conservation of TCR sequence patterns captured by MixTCRpred, or the presence of public clones that are found across different samples/studies.To shed light on this question, we focused on TCRs specific for the class II H2-IAb, DIYKGVYQFKSV epitope from LCMV-infected Mus Musculus samples for which multiple samples from 2 studies were available (Supplementary Figure 2).Out of a total of 11 samples, we observed that most TCRs are unique to one sample (approximately 98% of the epitope-specific TCR repertoire), and only a limited number of TCR sequences are shared across two or more samples (Supplementary Figure 2B-2C).Despite this low clonal overlap, the CDR3α and CDR3β display similar motifs across different samples and studies (Supplementary Figure 2D).These epitope-specific T cells came from mono-allelic Mus Musculus strains elevated in controlled conditions.Therefore, most of the observed TCRs variability is attributable to the fact that a large number of different TCRs can be generated to target the same pMHC, as long as they satisfy the statistical constraints reflected by the sequence patterns and captured by MixTCRpred.Similar results were obtained for other epitopes (Supplementary Figure 3).Next, we benchmarked MixTCRpred with other publicly available tools that take αβ TCRs as input .First, we evaluated the performance of our predictor against the other available pre-trained models accessible through command-line or web interfaces, i.e.NetTCR2.1 35 , ERGO2.0 (AE), ERGO2.0 (LSTM) 37,54 and tcrdist3 50 (see Methods ).To this end, we used the McPAS database as a test set, which is not part of the training dataset of most tools considered in this validation with the exception of NetTCR2.1.MixTCRpred was retrained excluding data from this database as well as overlap in other databases (see Methods ).
Comparison of the performance was done for the set of pMHCs that were supported by each tool in their pre-trained version.Our results demonstrate that MixTCRpred consistently outperforms other available tools for Homo sapiens and Mus Musculus pMHCs (Figure 2E).
To extend our benchmark to other methods, including some that have not yet been released, we capitalized on the recent IMMREP22 dataset 43 consisting of curated data for 17 peptides-MHC each having at least 50 unique validated binding αβTCR sequences.This dataset was specifically collected to benchmark the algorithms behind TCR-pMHC interaction predictors (i.e. using the same training and test sets for all methods).Upon retraining our tool on the same training set as all other tools, we observed that MixTCRpred achieved similar or higher accuracy on the test set (median AUC of 0.891, Figure 2F).This indicates that the architecture of MixTCRpred provides state-of-the-art performance, even when not considering our efforts to enhance and curate the training set.
We further used the IMMREP22 dataset to assess the role of the CDR1 and CDR2 sequences in MixTCRpred (Supplementary Figure 4A).We observed that predictions with the CDR1, CDR2, and CDR3 as input features are more accurate than those obtained using only the CDR3 for most pMHCs (Supplementary Figure 4B), which is consistent with previous observations 35,43 .Overall, our results show that MixTCRpred achieves robust predictions for pMHCs for which several interacting TCRs have been experimentally determined.

MixTCRpred reveals how much predictions can be extended to unseen epitopes
To investigate whether predictions may be extended to epitopes not present in the training set, we adapted MixTCRpred architecture to incorporate both the peptide and TCR sequences as inputs, resulting in a so-called pan-epitope predictor.This involved adding an extra embedding and transformer encoder layer for the epitope sequence, and concatenating it with the TCR before the final classification layer (Supplementary Figure 5).
By doing so, the model is in theory able to learn correlation patterns between the TCR and epitope sequences, and potentially predict TCRs binding to epitopes without any known TCR (i.e.unseen epitopes) 35 .To avoid overly complex models, we trained a separate pan-epitope MixTCRpred model for each MHC allele.
To evaluate the performance of the pan-epitope version of MixTCRpred, we first performed a 5-fold validation to predict TCRs interacting with epitopes already present in the training set.The pan-epitope predictor demonstrated performances similar or lower to the pMHC-specific MixTCRpred predictor (Figure 3A and Supplementary Figure 6A).The lower prediction accuracy of the pan-epitope model was especially significant for epitopes with more than 50 TCRs (Figure 3B and Supplementary Figure 6B).Overall, this indicates that incorporating all available TCR-epitope pairs in the training of MixTCRpred is less effective for TCR-epitope predictions than t raining specific models for each epitope, thereby supporting our choice of an epitope-specific architecture in the final version of MixTCRpred.
These results are consistent with previous studies 35 .Next, we investigated the ability of the pan-epitope model to predict TCRs interacting with unseen epitopes, by performing a leave-one-epitope-out validation.This analysis revealed limited accuracies, with a median AUC of 0.59 (Figure 3C).Out of 16 cases with AUCs > 0.8, 11 of them had an epitope in the training set differing by only one amino acid (Figure 3D).
We next computed the sequence similarity between each pair of epitopes restricting to epitopes presented by the same MHC allele (similarity of 1 corresponds to identical epitopes, see Methods ).When the test epitope in the leave-one-epitope-out validation had high sequence similarity with one of the epitopes in the training set, the pan-epitope predictor gives in general better than random predictions (Figure 3D).As the similarity between the test epitope and those in the training set decreases, predictions become close to random (Figure 3D).Overall, this suggests a model where predictions can be transferred to new epitopes almost only if TCRs binding to highly similar epitopes are known.As an example for this observation, TCRs binding to HLA-A*02:01,ELAGIGILTV exhibited similar motifs to those binding to HLA-A*02:01,E A AGIGILTV -the two epitope differing only at the unexposed HLA anchor position (BLOSUM similarity of 0.89) -whereas the similarity was lower with more different epitopes, such as HLA-A*02:01, K L VAL GI NA V (BLOSUM similarity of 0.39) (Figure 3E).Including ELAGIGILTV TCR sequences in the model could be informative to predict E A AGIGILTV TCRs, but not for K L VAL GI NA V. To evaluate the likelihood of a given epitope to show high enough similarity to one epitope in the training set of MixTCRpred, we collected all T cell epitopes in IEDB 39 .We observed that less than 0.03% have sequence similarity higher than 0.8 (Supplementary Figure 7) .Overall, our results show that extending predictions to unseen epitopes is challenging with the current amount of TCR-pMHC sequence data, with successful predictions possible only when the unseen epitope has high sequence similarity with at least one epitope in the training set of MixTCRpred.These findings align with observations from previous studies .CC-BY-NC-ND 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted September 16, 2023.; https://doi.org/10.1101/2023.09.13.557561 doi: bioRxiv preprint 42,64,65 , further supporting the fact that generalizing predictions to any epitope is currently an unmet challenge, irrespective of the architecture of the algorithm that is used.
MixTCRpred provides a quality control tool for scTCR-Seq data of T cells labeled with DNA-barcoded pMHC multimers DNA-barcoded pMHC multimers provide a powerful way to simultaneously label T cells recognizing distinct epitopes.This approach was recently used by 10X Genomics to identify and sequence T cells specific for 44 different epitopes 38 .In this assay, the binding specificity of a T cell was determined by the DNA-barcoded pMHC multimer with the highest UMI counts across all possible multimers (Figure 4A).However, TCR-epitope interaction predictors trained on this dataset in general achieved poor performances 35,40 .To shed light on this issue, we calculated for each cell the fraction of UMIs specific to the pMHC multimer with the highest counts, hereinafter F max (Figure 4A and Methods ).This revealed high variability of F max distributions across donors and epitopes (Figure 4B).For most cases with high MixTCRpred AUC, we observed that the MHC alleles used in the multimers were also found in the corresponding donors (points with a black border in Figure 4C, donor HLAs in Supplementary Table 2) 40 .One exception consists of donor 4 where the IVTDFSVIK epitope in complex with HLA-A*11:01 was used.This donor was HLA-A*03:01 positive and the two alleles show highly similar motifs (Supplementary Figure 8), suggesting that TCRs isolated with the HLA-A*11:01 multimer may be cross-reactive with the same epitope in complex with HLA-A*03:01.Overall, our findings show that MixTCRpred is a valuable quality control tool for single-cell TCR-Seq data of epitope-specific T cells labeled with barcoded pMHC multimers in different donors.

MixTCRpred reveals epitope-specific chains in epitope-specific dual α T cells
Approximately 10% of T cells express two distinct α chains on the cell surface 24 .Many approaches assume that the chain with higher expression (higher UMI counts, or higher read counts if both chains have the same UMI count) is the one mediating epitope recognition 27,29 .To investigate the validity of this assumption, we focused on the 10X Genomics dataset 38 and selected two pMHCs with a large number of αβTCRs, namely HLA-A*02:01, GILGFVFTL (839 αβTCRs) and HLA-A*02:01, ELAGIGILTV (169 αβTCRs).Next, we retrieved T cells expressing two α and one β chains (152 for HLA-A*02:01, GILGFVFTL, and 18 for HLA-A*02:01, ELAGIGILTV) after filtering out doublets (see Methods) .
For each epitope, we trained a specific MixTCRpred model with the αβTCR sequences from single α T cells.We then used it to predict the binding of TCRs from dual α T cells (α x α y -β), by considering each α chain separately (α x -β and α y -β TCRs, see Figure 5A).In most cases, the two α chains exhibited significantly different MixTCRpred %ranks (Figure 5B).The best predicted MixTCRpred binder coincided with the α chain with higher expression in approximately 60% of the dual α T cells (85 cases out of 152 for the HLA-A*02:01, GILGFVFTL, and 11 out of 18 for the HLA-A*02:01, ELAGIGILTV) (Supplementary Figure 9A).
To validate that the α chain with the best MixTCRpred prediction (hereinafter chain α 1 ) was the one involved in epitope recognition, we selected a set of 6 dual α T cells for HLA-A*02:01, GILGFVFTL and 5 for HLA-A*02:01, ELAGIGILTV (Table 1 and Supplementary Figure 10).RNA encoding α 1 βTCR and α 2 βTCR from these dual α T cells was synthesized and electroporated into TCR-Jurkat cells.After overnight incubation, TCR transfected cells were interrogated by pMHC-multimer staining (see Methods ). Figure 5C demonstrates that α chains predicted by MixTCRpred were binding to the pMHCs in all cases, while the other chains did not bind.Similar results were not obtained with predictions based on the highest UMI count, which identified the correct epitope-specific α chains only in 50% and 60% of the tested cases (Figure 5D and Supplementary Figure 11).
An alternative approach to identify epitope specific α chains is to look for exact matches of the α x -β or α y -β TCRs in our comprehensive database of single α T cells.Exact matches could be found for 96 cases for HLA-A*02:01, GILGFVFTL, including only 3 cases (TCR-1, -4, and -5) for the TCR sequences experimentally tested, and none for the HLA-A*02:01, ELAGIGILTV epitope (Figure 5D and Supplementary Figure 8B).protein and restricted to HLA-DPB1*04:01) in TCR repertoires.We collected αβTCR repertoires of CD4+ T cells from multiple studies that isolated T cells from the peripheral blood of both COVID-19-positive and COVID-19-negative patients [66][67][68] .This collection included T cells that were stimulated with a range of SARS-CoV-2 proteins 66,67 , as well as T cells that were sequenced directly ex-vivo without any stimulation 68 , with a total of 205,930 CD4+ T cells from 138 COVID-19-positive patients and 46 healthy donors (see Methods ).The HLA alleles of the patients were not provided.
Next, we calculated the proportion of TCRs that were predicted to target the TFEYVSQPFLMDLE epitope within each TCR rep ertoire.A threshold of 0.1% was used on MixTCRpred %rank (see Methods ).Acr oss all three studies we observed an enrichment and overall higher fraction of CD4+ T cells predicted to target this immunodominant epitope in repertoires from COVID-19-positive patients (Figure 6A-B).Among our predictions, we also observed several cases of expanded CD4+ T cells (Figure 6C).The overall ratio of T cells predicted to be TFEYVSQPFLMDLE specific was particularly pronounced in samples from the Bacher et al. study 66 .In this study CD4+ T cells were stimulated with peptides from the SARS-CoV-2 spike protein that included the TFEYVSQPFLMDLE peptide.Conversely, in the Meckiff et al. study 67 a different peptide pool was used for stimulation, which did not encompass the TFEYVSQPFLMDLE peptide 69 and the overall enrichment was less prominent.
The expanded clones were less frequent in unstimulated cells from PBMC, as expected for epitope-specific CD4 T cells.
Our results indicate that MixTCRpred offers a robust framework for in-silico analysis of epitope-specific T cells directly from TCR repertoires, and reveals enrichment of T cells predicted to be specific for the immunodominant DPB1*04:01,TFEYVSQPFLMDL epitope in COVID-19 positive patients.
. CC-BY-NC-ND 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted September 16, 2023.; https://doi.org/10.1101/2023.09.13.557561 doi: bioRxiv preprint repertoire data.In this work, we trained a predictor of TCR-pMHC interactions for a set of epitopes with enough known TCRs.These epitopes cover many common viruses, as well as some cancer antigens.
Our work indicates that reasonable prediction accuracy requires roughly 50 αβTCRs.These results provide a strong motivation for global initiatives to collect many TCRs recognizing diverse epitopes, and we anticipate that the both data collected and curated in this work and the MixTCRpred framework will contribute to this global endeavor.
Several existing tools have attempted to extrapolate predictions to other epitopes, including novel epitopes without any known TCRs, by considering both the TCR and the epitope sequences in the input of their machine learning framework.While some approaches have reported some success 47 , other studies have reached the conclusion that extending predictions to any epitope is currently not feasible 64,65 .Our results, which align with previous findings 45,46,70 , suggest that cases where predictions could be extrapolated to new epitopes consist mostly of epitopes having high similarity to an epitope in the training set.As of today, these cases represent only a small minority of all possible epitopes.Moreover, we cannot fully exclude some level of circularity, since TCRs tested with a given epitope may have been selected based on their similarity with TCRs interacting with another highly similar epitope.These observations indicate that extrapolation to any epitope is still an unsolved challenge.Our work also suggests that mixing TCRs interacting with different epitopes when training TCR-pMHC interaction predictors in general does not improve and may even lower the prediction accuracy, thereby justifying our choice to train a separate predictor for each epitope.
Application of MixTCRpred to data generated with the 10X Genomics immune profiling platform 38 suggests that several cases (i.e., pMHC-donor pairs) may include a substantial fraction of contaminants.This observation has important consequences since 10X Genomics data currently constitute 70% of all paired αβTCR-epitopes in VDJdb 28 and 66% in IEDB 39 .
Filtering out putative contaminants led to the removal of roughly 85% of the 10X genomics data, which is consistent with estimates of contaminants in other studies 27,40 .Some of these studies include HLA mismatch between the donor and the pMHC multimers in the filtering  40 .Our results support this approach but suggest to keep cases of mismatched HLAs with similar motifs (e.g., HLA-A*03:01 and HLA-A*11:01).
The remaining TCR-pMHCs pairs demonstrate consistent TCR sequence patterns and good internal AUC.These data were highly valuable for improving and expanding the epitope coverage of MixTCRpred.We anticipate that improvements in both DNA barcoded multimer technology and post-processing tools will enable researchers to collect large amounts of TCR-pMHCs interactions in the near future with this technology.Such data will be instrumental to characterize the TCR specificity of viral or cancer epitopes, and possibly one day for training TCR-pMHC interaction predictors for any epitope.
Our work shows that MixTCRpred can accurately identify the epitope-specific chain in dual α T cells recognizing specific epitopes, even when this chain did not have the highest UMI count and was not present among the single α chain T cells.These results have important implications for processing of single-cell TCR-Sequencing of epitope-specific T cells.For instance, several entries in the VDJdb 29 reported the α chain with the highest UMI count, while the MixTCRpred score was much higher for the other α chain.These include the sequences CGTEWEARLMF (TCR-2) and CAVFFEGGATNKLIF (TCR-6) for the HLA-A*02:01, GILGFVFTL and CAETTGALYSGAGSYQLTF (TCR-3) CAEKDDKIIF (TCR-5) for HLA-A*02:01, ELAGIGILTV, which we experimentally validated as non-binders.TCR-pMHC interaction prediction tools like MixTCRpred could help address this issue and improve the quality of data stored in these databases.The clear differences observed between the scores of the two chains in dual α T cells indicate that most of these cases were not doublets and that, in general, only one α chain is responsible for the epitope specificity and cases where both α chains recognize the same epitope appear to be very rare.Our observations further suggest to use the highest UMI criteria with caution, also when dealing with dual α T cells of unknown specificity.
The enrichment and expansion of TCRs predicted to target an immunodominant SARS-Cov-2 epitope in TCR repertoires from COVID-19-positive patients, even without the possibility to stratify patients based on their HLA alleles, suggests that many of these clonotypes are specific for TFEYVSQPFLMDLE SARS-CoV-2 epitope.This is consistent with the fact that this epitope is immunodominant 69  HLA-DPB1*04:01, found in >40% in diverse populations 71 ).Moreover, the HLA-DPB1*04:01 motif has high similarity with many other HLA-DP alleles (Supplementary Figure 12).TCRs predicted to target TFEYVSQPFLMDLE in COVID-19-negative donors or in COVD-19-positive donors with incompatible alleles may represent cross-reactive TCRs with other epitopes, as expected from previous observations of SARS-CoV-2 reactive TCRs in patient before COVID-19 infection 72,73 .
In summary, our study provides a high-quality dataset of TCR-pMHC interactions for several common viral and cancer epitopes (Supplementary data) as well as a robust command-line tool (https://github.com/GfellerLab/MixTCRpred) to predict new TCRs binding to these epitopes.Beyond computational screening of TCR repertoires, our work shows that MixTCRpred can be used as a quality control tool for single-cell TCR-sequencing data of T cells labeled with DNA barcoded multimers, as well as to annotate α chains mediating epitope recognition in epitope-specific dual α T cells.Considering the rapid developments of technologies to isolate and sequence epitope-specific T cells, we anticipate that the epitope coverage of TCR-pMHC interaction predictors will keep increasing, making such tools relevant for in-silico identification of TCRs recognizing known viral or cancer epitopes directly from TCR repertoire, as demonstrated for the SARS-CoV-2 epitope analyzed in this work.

TCR-epitope sequence data
TCR-pMHC pairs were collected from publicly available datasets, including VDJdb 29 , (data download 27/10/2022), IEDB 39 (data download 02/11/2022) and the McPAS database 59 (data download 27/10/2022).TCR-pMHCs pairs from the 10X Genomics dataset 38 were processed separately.Additional data for Mus musculus were retrieved from two recent studies 60,61 .Only paired TCR sequences (with both the α and the beta β sequences) were considered, and sequences containing non-standard amino acids were removed.Duplicated TCR-pMHC were merged based on V/J gene usage and CDR3 sequence for both the αand βchain.

Pre-processing single-cell dataset
Multiple datasets used in this study were generated using the Chromium platform of 10 X Genomics 38,60,61,74 and processed with the Cell Ranger Single Cell Software Suite by 10 X Genomics.
To ensure high quality data standard quality control on transcriptomic data was performed, by: a. Filtering out cells with low/high UMIs (<1500 or >15000 UMIs and remove top/bottom 1%) b.Filtering out cells with low number of genes (<700 UMIs and remove top/bottom 1%) c.Filtering out cells with high mitochondrial/ribosomal data (<10% mitochondrial gene, <50% ribosomial genes) The analysis was done with the scanpy library 75 .The scirpy library 76 was used to integrate TCR sequence with transcriptomics data and to identify T cells with multiple chains.Doublets were identified and removed with the scrublet package 77 .

10X Genomics dataset
For the 10X Genomic dataset 38 , after standard single-cell dataset preprocessing, one additional step is required to match each TCR with the cognate DNA-barcoded multimer.
Following the guidelines outlined in the 10X Genomic documentation, cells with less than 10 multimer UMI counts were filtered out.Additionally, cells were also removed if their UMI counts for a specific multimer were not significantly higher than the UMI counts for negative control multimers (at least 5 times greater than the negative controls).Finally, cells were excluded if they had UMI counts for more than 5 different multimers.Each remaining cell was then matched to the multimer with the highest UMI counts, which was attributed as the specificity of the T cell (a total of 67,084 epitope-specific T cells and 14,887 αβTCR clonotypes).50,625 of them were αβ T cells (10,376  -sampling TCRs specific to other pMHCs (negative/positive ratio of 1:1).In order to avoid the model from learning biased patterns due to the imbalanced distribution of TCR-pMHCs sequence data (Figure 1C-D), a weight was assigned to each sequence before sampling.This weight was calculated as the reciprocal of the total number of TCRs that bind to the corresponding epitope.Homo sapiens and Mus musculus epitopes were treated separately.
-sampling TCRs from TCR repertoires (negative/positive ratio of 4:1).Homo sapiens αβTCR repertoires were downloaded from iReceptor 9 , and from two recent studies for Mus musculus epitopes 9,74 .When studying specifically the 10X Genomics dataset (leave-one-sample out, MixTCRpred for quality control of the 10X Genomics dataset, MixTCRpred to investigate dual α T cells) TCRs not assigned to any epitopes (UMIS counts = 0) were used as negatives 35,36 .
As a result, the final dataset had a negative to positive ratio of 5:1.

MixTCRpred model
MixTCRpred is a transformer-based model 63 written in Python, relying on the PyTorch 78 and PyTorchLighting 79 libraries.For each pMHC in our dataset a specific MixTCRpred model was trained with experimentally validated TCRs and computationally generated negatives.For each TCR used as input, CDR1, CDR2 (from the V gene) and CDR3 sequences for the αand the βchain were retrieved separately.The sequences were padded, concatenated and numerically embedded using the nn.Embedding function of PyTorch (learned embedding) and a positional encoding.A transformer encoder was then used 63 , followed by a dense classification layer to output the MixTCRpred binding score (with higher score sequences more likely to bind).
To achieve comparability across models, for each input TCRs the corresponding % rank was also calculated.To this end, Homo sapiens αTCRs and βTCRs from iReceptor 9 were collected.
For Mus musculus αTCRs were downloaded iReceptor 9 while βTCRs from three different studies [80][81][82] from the immuneACCESS website.Next, treating Homo sapiens and Mus musculus separately , α and β TCR sequences were randomly paired to generate 10 6 different TCRs that were scored using each one of the 146pMHC MixTCRpred models.The MixTCRpred scores were standardized by subtracting the mean and dividing by the standard for individual TCRs were not provided.ERGO2.0 37,54 was not part of this validation, and was separately re-trained and tested.

Sequence similarity
To compute the sequence similarity between a test epitope and an epitope in the training set, the two sequences were aligned with the pairwise2 align function from the biopython package 83 using the BLOSUM62 scoring matrix 84  were prepared fresh and used within a week or kept aliquoted at -80°C 85 .
In brief, codon-optimized DNA sequences coding for paired α and β chains including the mouse constant region instead of the human one, were synthesized at GeneArt (Thermo Fisher Scientific) or Telesis Bio DNA.The DNA fragments served as template for in vitro transcription (IVT) and polyadenylation of RNA molecules as per the manufacturer's instructions (Thermo Fisher Scientific), followed by co-transfection into recipient T cells.Jurkat cells were electroporated using the Neon electroporation system (Thermo Fisher Scientific) with the following parameters: 1,325 V, 10 ms, three pulses.After overnight incubation, electroporated Jurkat cells were interrogated by pMHC-multimer staining with the following surface panel: anti-CD3 APC Fire 50 (SK7, Biolegend), anti-CD8 Pacific Blue™ (RPA-T8, BD Biosciences), anti-mouse TCRβ-constant APC (H57-597, Thermo Fisher Scientific), pMHC-multimer-PE and with viability dye Aqua (Thermo Fisher Scientific).

Figure 1 .
Figure 1.Integration and curation of αβTCR-pMHCs interactions reveal binding specificities for dozens of class I and class II epitopes.(A) Overview of our pipeline, including data collection, training of MixTCRpred and applications.(B) Summary of the datasets collected in this study with the corresponding number of TCRs and pMHCs.(C) Distribution of pMHCs interacting with different numbers of TCRs.146 pMHCs have 10 or more experimentally validated binding αβTCRs, with a total of 17,715 αβTCRs.(D) Barplots showing the number of αβTCRs for the top 20 pMHCs with most experimentally validated αβTCRs.(E) Distribution of TCRs recognizing epitopes restricted to different MHC alleles.

.
CC-BY-NC-ND 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted September 16, 2023.; https://doi.org/10.1101/2023.09.13.557561 doi: bioRxiv preprint

Figure 2 .
Figure 2. MixTCRpred accurately predicts TCRs recognizing specific pMHCs.(A) Illustration of the MixTCRpred model architecture.For each pMHC, MixTCRpred predicts if a TCR (encoded based on the CDR1, CDR2, and CDR3 αand β sequences) would target it.The outputs are the predicted MixTCRpred interaction score and the corresponding % rank.(B) 5-fold cross-validation average AUCs for the 146 pMHCs included in our dataset.The vertical lines show the standard deviations.The dashed lines correspond to random (red line, AUC of 0.5) and perfect predictions (black line, AUC of 1) and (C) Comparison of the AUC values in leave-one-sample-out cross-validation (one point for each sample and each pMHC) with the average AUC values in the 5-fold cross-validations (one point for each pMHC).(D) Comparison of the AUC values in leave-one-study-out cross-validation (one point for each

Figure 3 .
Figure 3. MixTCRpred reveals how much predictions can be extended to unseen epitopes.(A) Comparison of the 5-fold cross-validation average AUCs to predict TCRs for epitopes already present in the training set for the pMHC-specific and the pan-epitope version of MixTCRpred.(B) Comparison of the 5-fold cross-validation average AUCs for pMHC-specific and the pan-epitope version of MixTCRpred considering only epitopes with more than 50 TCRs.The p-values were obtained with a paired t-test.(C) AUC values of the leave-one-epitope out validation for the pan-epitope version of MixTCRpred.(D) Sequence similarity between the test epitopes and the most similar epitope in the training set, and the Next, to mimic the situation where pMHCs multimers without previously known interacting TCRs are being used in individual donors, we trained and tested a specific DeeTCRpred model for each combination of pMHCs and donors, and compared the AUC obtained from standard 5-fold cross-validation with the median F max .A clear correlation between these two values was observed, indicating that cases (i.e., donor-pMHCs) with multiple pMHC barcodes per cell demonstrated low internal consistency in their TCR sequences (Figure4C).These include cases with a large training set (e.g.HLA-A*03:01, KLGGALQAK, with 19753 specific T cells corresponding to 7182 different clonotypes for the 4 donors, Figure4B-C).On the contrary, when the T-cell specificity was unambiguous, reflected by a high fraction of UMIs for a specific pMHC multimer (e.g., the HLA*A-02:01, GILGFVFTL specific T cells), accurate predictions could be achieved, indicating high-quality training data.These observations motivated us to only include data from donor-pMHC with a median F max > 0.75 (i.e., 1704 TCR clonotypes specific for 5 pMHCs, see Methods ) to train MixTCRpred.

Figure 4 .
Figure 4. MixTCRpred provides a quality control tool for scTCR-Seq data of T cells labeled with DNA-barcoded pMHC multimers.(A) Illustration of a T cell labeled with DNA-barcoded pMHC multimers, together with the distribution of UMI for different pMHC multimers and the corresponding F max .(B) F max values for each T cell in each donor-pMHC sample.(C) Comparison between the median F max in each donor-pMHC sample and the AUC of the corresponding MixTCRpred model.The size of each point is proportional to the number of clonotypes.Black border indicates matches between the donor MHC alleles and the MHC of the multimer.The Pearson correlation and the corresponding p-value are reported.

.
CC-BY-NC-ND 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in The copyright holder for this this version posted September 16, 2023.; https://doi.org/10.1101/2023.09.13.557561 doi: bioRxiv preprint Overall, these results indicate that MixTCRpred offers a robust framework for identifying epitope-specific chains in dual α T cells and overcomes limitations of methods such as UMI counts or exact TCR sequence matches in single α T cells.

Figure 5 .
Figure 5. MixTCRpred reveals epitope-specific chains in epitope-specific dual α T cells.(A) Overview of our pipeline to investigate epitope-specific chains in dual α T cells.(B) MixTCRpred %ranks of the two TCRs in dual α T cells specific to HLA-A*02:01, GILGFVFTL and to HLA-A*02:01, ELAGIGILTV.Chain α1 is defined as the one with the best MixTCRpred score.(C) Multimer staining of the TCRs with the predicted α chain (chain α1) and TCRs with the other α chain (chain α2) for the dual α T cells used in our experimental validation.(D) Fraction of correctly predicted α chains by MixTCRpred, considering the most expressed chains, and with exact matches in αβTCRs from single α T cells.

Figure 6 .
Figure 6.MixTCRpred reveals enrichment of TCRs specific for an immunodominant SARS-CoV-2 epitope in COVID-19 positive patients.TCR repertoires of stimulated CD4+ from Bacher et al. 66 , Meckiff et al. 67 , and of unstimulated CD4+ from Stephenson et al. 68 .(A) Fraction of T cells predicted to be TFEYVSQPFLMDLE specific for each patient.The p-values were obtained with an independent t-test.(B) Overall fraction of T cells with undetermined specificity (in gray) or predicted to be specific for the immunodominant TFEYVSQPFLMDLE epitope (in blue).(C) Clone size of each TCR clonotype.
. The resulting pairwise alignment score was then divided by the score obtained aligning the test epitope sequence with itself so that the maximal similarity score was 1.The similarity score can assume negative values due to negative entries in the BLOSUM62 matrix.Scores closer to 1 indicate greater similarity between peptide pairs.Peptides and pMHC multimers production Peptides and HLA-A*02:01,GILGFVFTL and HLA-A*02:01,ELAGIGILTV multimers were produced by the Peptides and Tetramers Core Facility (PTCF) of the Department of Oncology, University of Lausanne and University Hospital of Lausanne.HPLC purified peptides (≥90% pure), were verified by UHPLC-MS and kept lyophilized at -80°C.Peptide-MHC multimers 50tcrdist350), while .CC-BY-NC-ND 4.0 International license perpetuity. Itis made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in .
. CC-BY-NC-ND 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in

Table 1 . List of the dual α TCRs selected for experimental validation. Six
dual α TCRs are specific to HLA-A*02:01, GILGFVFTL and five specific to HLA-A*02:01, ELAGIGILTV.The α1 chain is the best MixTCRpred prediction.
. CC-BY-NC-ND 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in and restricted to a frequent HLA-DP allele (i.e., .CC-BY-NC-ND 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in . CC-BY-NC-ND 4.0 International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in International license perpetuity.It is made available under a preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in